Pathological voice quality assesment using artificial neural networks
نویسندگان
چکیده
This paper describes a prototype system for the objective assessment of voice quality in patients recovering from various stages of laryngeal cancer. A large database of male subjects steadily phonating the vowel /i/ was used in the study, and the quality of their voices were independently assessed by a speech and language therapist (SALT) according to their 7-point ranking of subjective voice quality. The system extracts salient short-term and long-term time-domain and frequency-domain parameters from impedance (EGG) signals and these are used to train and test an Artificial Neural Network (ANN). Multi-layer Perceptron (MLP) ANNs were investigated using various combinations of these parameters, and the best results were obtained using a combination of short-term and longterm parameters, for which an accuracy of 92% was achieved. It is envisaged that this system could be used as a screening tool and provide a valuable aid to the SALT during clinical evaluation of voice quality. Introduction An increasingly important factor in prescribing treatment for cancer of the larynx is the quality of voice retained post-therapy. At present, Speech and Language Therapists (SALT) endeavor to rehabilitate a patient's voice back to normality, or as near normal as possible, quickly following treatment. They currently assess voice quality on a 7-point ranking (0=least abnormal, 6=most abnormal) based on a variety of sound parameters, some of which are well defined, such as shimmer and jitter, while others, such as whisper and creak are descriptive or have tenuous physical correlates, As a result, the assessment is largely subjective and depends upon the experience of the SALT. This situation will be clearly improved by the availability of an objective voice-quality assessment system, which can provide accurate, reproducible, graded measures of a patient's voice quality to help the SALT plan the patient's rehabilitation. Earlier work has shown that an MLP trained using features derived from a normalised power spectral representation, the FundamentalHarmonic Normalised spectrum (FHN) [1], of stationary vowel segments can classify EGG speech signals as normal or abnormal with an accuracy of 80% [2]. Whilst this provided good classification between normal and abnormal voice quality, the feature set was limited to sub-optimal classification results as it is well known that some pathologies are measured more easily using long-term (>50ms) parameters [3]. This paper describes the refinement of the ANN approach to voice quality assessment, by introducing long-term features to the prototype classification system. In addition, the extension of the system to provide a sub-classification of abnormal voices in line-with the SALT 7-point ranking scheme is investigated, and preliminary results are presented. Data Capture The data used to develop the system was captured off-line under clinical conditions at the Christie and Withington Hospitals in Manchester, using an Electrolaryngograph PCLX system [4]. This system is used to capture electrical impedance (EGG) signals using pads placed either side of the neck MAVEBA 2001, Firenze, Italy 230 synchronously with acoustic signals captured using a microphone. Both EGG and acoustic data channels were captured synchronously at 20kHz for up to 3 seconds while the subject phonated the vowel /i/ as steadily as possible. Although speech data was recorded for both male and female patients, the largest pathological group was male, so it is these speech signals that were used in the study, For each patient the SALT made a subjective voice quality assessment using a 7-point ranking. Data Processing A voicing analysis was performed upon each 3second EGG and acoustic speech signals to determine if the subject had voiced during phonation. If voicing was considered to have occurred, the EGG signal was initially processed to extract the long-term features, and then the short-term features for classification of voice quality. The long-term features are mean fundamental frequency, of f0 (Mf0), standard deviation of f0 (SDf0), and percentage of the signal that is voiced (V+), while the short-term features include parameters related to the structure of the first few harmonics, and the glottal noise. The voicing test involved taking 50msec frames from the signals and applying Cepstral analysis techniques [5], to identify the voiced frames. Each frame was then pre-emphasised by forward differencing to suppress the effects of drifting signal amplitude, and its autocovariance multiplied by a Hanning-Tukey window, prior to transformation to the frequency domain using the Fast Fourier Transform. An estimate of f0 for each frame, deduced during the voicing analysis, was used to derive the FHN normalised spectral representation. This process removed any interpatient variability in f0 and its harmonics allowing a more effective modelling of the spectral envelope among groups of patients. Once the FHN spectrum had been determined, Gaussians were fitted to the data around f0 and its first few harmonics. Each Gaussian, Gh, (h=0 up to typically 8) was parameterised as: Gh = (positionh, widthh and amplitudeh) An observation was made that the mixture of Gaussians gave a better ‘fit’ to the FHN spectrum for the less abnormal patients, and so a parameter related to goodness of fit, called the Harmonic Linearity Measure (HLM), was calculated for each frame. Finally, as Glottal noise is considered to be an important measure of voice quality, a parameter, FHNNE, based on the Normalised Noise Energy (NNE) [6], but derived from the FHN spectrum, was calculated for the data. The data extracted from the speech data and used for the ANN classification tests comprised of 3 long-term parameters (Mf0, SDf0, V+) and 17 short-term parameters (G1, G2, G3, G4, G5, HLM, FHNNE). Full details of the data processing and extraction of these parameters can be found in McGillion [7]. Data classification A total number of 77 abnormal speech signals were available for training and testing data. For each of the 7 classes, 450 patterns were used for training/validation and 200 for testing. Unfortunately, as a result of the relatively small dataset, there were different numbers of patients in each class. As it is desirable to have equal numbers in each class to train an ANN adequately, additional frames were taken from some patients and a small percentage of the data was artificially generated by adding normally distributed noise to the short-term features of the existing patterns within each class. A two-layer 7-output MLP was trained using the back-propagation training algorithm, softmax activation function, and cross-entropy error function. The advantage of using the cross-entropy activation function, was that the output across all seven classes sums to 1.0 and can therefore be interpreted as a probability of membership of each of the seven classes, assuming equal prior probabilities. A further constraint placed upon the MLP is that for any single class to be declared the 'winner' the output for that class must be greater than 50% (0.5). MLP structures with different numbers of hidden units and subsets of the 21 input parameters were investigated in order to MAVEBA 2001, Firenze, Italy 231 Table 1. Test results for the seven-class ANN Sensitivity (%) Specificity (%) Accuracy (%) SD Class 0 Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Structure Inputs 98.5 96.0 94.5 80.5 86.5 91.5 96.5 Best Individual MLP 92.00 6.42 0.12 0.87 0.75 4.12 3.0 2.12 0.75 20-25-7 G1,G2,G3,G4,G5, FHNNE, HLM, Mf0, SDf0,V+ 98.1 94.6 93.1 86.9 83.6 81.7 94.1 Best Individual Structure 90.30 1.65 0.47 1.3 1.42 3.1 3.82 4.17 1.37 20-40-7 G1,G2,G3,G4,G5, FHNNE, Mf0, SDf0,V+ 98.0 93.5 92.2 81.1 77.3 74.8 93.5 Best Input Set 87.24 3.47 0.45 1.49 1.52 4.02 4.67 5.28 1.25 21-[15,25,40]-7 G1,G2,G3,G4,G5, f0, FHNNE,HLM, Mf0, SDf0,V+ determine the combination that provided the minimum classification error. Results and Discussion An overview of the best results obtained from the different combinations of input parameters and hidden units shown in Table 1. The best input set is the MLP structure whose input parameters provided the best classification results, regardless of weight initialisation and the number of hidden units. The best individual structure is the MLP that provided the best classification results taking into account the number of hidden units in the MLP, but disregarding the weight initialisations. The best individual MLP is that which takes into account both the number of hidden units and the weight initialisation. The best individual MLP can be different from the best individual structure and the best input set if the variance in the classification ability of a given MLP is large (due to the random weight initialisations) and although an individual MLP might produce a very high accuracy, the average performance can be very poor. Therefore, it is important to consider all three results when determining the best performing input set and MLP structure. The best overall ANN structure was a 20-25-7 MLP using the parameters [G1, G2, G3, G4, G5, FHNNE, HLM, Mf0, SDf0, V+], and the results indicate that this MLP was able to distinguish between the seven abnormal groups with an accuracy of up to 92%. Figure 1 shows the output of the MLP for each abnormal class. The output of the MLP is an estimate of the posterior probability of membership for each class. It should be noticed that there are only two cases where the output falls below the majority threshold 0.5, and only one misclassification (for class 3). Perhaps unsurprisingly, the classes at the two extremes of the scale, 0 and 6, provide the best classification results. In all cases, classes 3, 4, and 5 are the most difficult to discriminate between. All the short-term features were found to contribute to the classification. The classification accuracy increased from 26.5% with [G1] alone to 67.7% with [G1, G2, G3, G4, G5]. Adding the other short-term features [FHNNE] and [HLM] increased the discrimination ability of the MLP to 72.07% and 68.64% respectively. Similarly, the longterm features, were also found to be very important to the discrimination between the classes. These parameters [MF0], [SDF0], [V+] alone were able to distinguish between the classes with accuracies of 37.57%, 23.07%, and 26.78% respectively. However, as can be seen from the results, it is the combination of the short-term and long-term features that provide the most accurate classifications of the abnormal signals. MAVEBA 2001, Firenze, Italy 232 Conclusions The results from this work suggest that a voice quality assessment system incorporating an ANN can be trained to provide objective subclassifications of voice quality in line-with the 7-point ranking scheme used by the SALT. However, it should noted that the ANN has been trained on the assessments of one SALT, which could lead to subjectively biased results. The collection of patient speech data, including voice quality rankings from several SALTs in the region is now taking place, and will hopefully provide a larger, and less biased dataset for training the system. At the same time, work is taking place toidentify and evaluate other parameters that canbe derived from the speech data, in particularthe acoustic data which has been largelyignored in this study so far, in order to furtherimprove the accuracy and reproducibility of thesystem. AcknowledgementsThe support of this work by the EPSRC awardGR/L51546 is greatly appreciated.References[1]Moore CJ, Slevin N, Winstanley S. (1999)Characterising vowel phonation byfundamental spectral normalisation of LX-waveforms. Proceeding of the InternationalWorkshop on Models and Analysis of VocalEmissions for Biomedical Applications.1:1-6 [2]Ritchings RT, McGillion M, Conroy G,Moore C (1999) Objective assessment ofpathological voice quality. Proc IEEE SMC99,2:340-345. ISBN: 0-7803-5683-7 [3]Baken RJ (1992) Electroglottography.Journal of Voice 6(2):98-100 [4]Fourcin AJ, Abberton E, Miller D, Howell D(1996) Laryngograph: Speech pattern elementtools for therapy, training and assessment.European Journal of Disorders ofCommunication 30(2)101-115 [5]Noll A (1967) Cepstrum pitchdetermination. Journal of the AcousticalSociety of America. 1967:41:293-309 [6]Kasuya H, Ogawa S, Mashima K, Ebihara S(1986) Normalised noise energy as an acousticmeasure to evaluate pathologic voice. Journalof the Acoustic Society of America.80(5):1329-1334 [7]McGillion (2000). Automated Analysis ofVoice Quality. Ph.D. Thesis, UMIST MAVEBA 2001, Firenze, Italy233 Figure 1. The ANN classificationperformance for each of the 7-classesMLP CLASSIFICATION OF CLASS 0 ABNORMALS 0.00.10.20.30.40.50.60.70.80.91.0 0123456 C L A S SMLP CLASSIFICATION OF CLASS 1 ABNORMALS 0.00.10.20.30.40.50.60.70.80.91.0 0123456
منابع مشابه
Prediction the Return Fluctuations with Artificial Neural Networks' Approach
Time changes of return, inefficiency studies performed and presence of effective factors on share return rate are caused development modern and intelligent methods in estimation and evaluation of share return in stock companies. Aim of this research is prediction of return using financial variables with artificial neural network approach. Therefore, the statistical population of this study incl...
متن کاملPathological voice quality assessment using artificial neural networks.
This paper describes a prototype system for the objective assessment of voice quality in patients recovering from various stages of laryngeal cancer. A large database of male subjects steadily phonating the vowel /i/ was used in the study, and the quality of their voices was independently assessed by a speech and language therapist (SALT) according to their seven-point ranking of subjective voi...
متن کاملDetermining water quality along the river with using evolutionary artificial neural networks (Case Study, Karoon River , Shahid Abbaspur-Arab Asad reach)
Rivers are important as the main source of supply for drinking, agriculture and industry.However, drinking water quality in terms of qualitative parameters, is the most important variable. Studias and predicting changes in quality parameters along a river, are one of the goals of water resources planners and managers. In this regard, many water quality models in order to maintain better water ...
متن کاملThe effects of inter and intra speaker variability on pathological voice quality assessment
This paper describes some methodological issues to be considered while facing the task of the objective assessment of voice quality from patients with laryngeal cancer. Earlier research works showed that the automatic assessment of voice quality could be addressed by means of short-term and long-term time-domain, and frequency-domain parameters extracted from electroglotographic (EGG) signals, ...
متن کاملSimultaneous Monitoring of Multivariate-Attribute Process Mean and Variability Using Artificial Neural Networks
In some statistical process control applications, the quality of the product is characterized by thecombination of both correlated variable and attributes quality characteristics. In this paper, we propose anovel control scheme based on the combination of two multi-layer perceptron neural networks forsimultaneous monitoring of mean vector as well as the covariance matrix in multivariate-attribu...
متن کاملIntegration of Color Features and Artificial Neural Networks for In-field Recognition of Saffron Flower
ABSTRACT-Manual harvesting of saffron as a laborious and exhausting job; it not only raises production costs, but also reduces the quality due to contaminations. Saffron quality could be enhanced if automated harvesting is substituted. As the main step towards designing a saffron harvester robot, an appropriate algorithm was developed in this study based on image processing techniques to recogn...
متن کامل